Skip to main content

Why Big Data Matters

Reference

When Do You Need Billions of Words of Pretraining Data?

Data Preparation

(Non)exotic completions of the group algebras of isotropy groups

清理重複資料的重要

Deduplicating Training Data Makes Language Models Better

固定資源下

大模型、小資料 vs 小模型、大資料 vs 中模型、中資料

最佳的比例：Training Compute-Optimal Large Language Models
LLaMA也採用LLaMA: Open and Efficient Foundation Language Models
Scaling Instruction- Finetuned Language Models

Reference
Data Preparation
- 清理重複資料的重要
固定資源下